A translation memory, or TM, is a database that stores so-called "segments", which can be sentences or sentence-like units (headings, titles or elements in a list) that have previously been translated. A translation memory system stores the words, phrases and paragraphs that have already been translated, in order to aid human translators. The translation memory stores the source text and its corresponding translation in language pairs called “translation units”.
Some software programs that use translation memories are known as translation memory managers (TMM).
Translation memories are typically used in conjunction with a dedicated computer assisted translation (CAT) tool, word processing program, terminology management systems, multilingual dictionary, or even raw machine translation output.
A translation memory consists of text segments in a source language and their translations into one or more target languages. These segments can be blocks, paragraphs, sentences, or phrases. Individual words are handled by terminology bases and are not within the domain of TM.
Research indicates that many companies producing multilingual documentation are using translation memory systems. In a survey of language professionals in 2006, 82.5 % out of 874 replies confirmed the use of a TM. Usage of TM correlated with text type characterised by technical terms and simple sentence structure (technical, to a lesser degree marketing and financial), computing skills, and repetitiveness of content[1].
Contents |
The program breaks the source text (the text to be translated) into segments, looks for matches between segments and the source half of previously translated source-target pairs stored in a translation memory, and presents such matching pairs as translation candidates. The translator can accept a candidate, replace it with a fresh translation, or modify it to match the source. In the last two cases, the new or modified translation goes into the database.
Some translation memories systems search for 100% matches only, that is to say that they can only retrieve segments of text that match entries in the database exactly, while others employ fuzzy matching algorithms to retrieve similar segments, which are presented to the translator with differences flagged. It is important to note that typical translation memory systems only search for text in the source segment.
The flexibility and robustness of the matching algorithm largely determine the performance of the translation memory, although for some applications the recall rate of exact matches can be high enough to justify the 100%-match approach.
Segments where no match is found will have to be translated by the translator manually. These newly translated segments are stored in the database where they can be used for future translations as well as repetitions of that segment in the current text.
Translation memories work best on texts which are highly repetitive, such as technical manuals. They are also helpful for translating incremental changes in a previously translated document, corresponding, for example, to minor changes in a new version of a user manual. Traditionally, translation memories have not been considered appropriate for literary or creative texts, for the simple reason that there is so little repetition in the language used. However, others find them of value even for non-repetitive texts, because the database resources created have value for concordance searches to determine appropriate usage of terms, for quality assurance (no empty segments), and the simplification of the review process (source and target segment are always displayed together while translators have to work with two documents in a traditional review environment).
If a translation memory system is used consistently on appropriate texts over a period of time, it can save translators considerable work.
Translation memory managers are most suitable for translating technical documentation and documents containing specialized vocabularies. Their benefits include:
The main problems hindering wider use of translation memory managers include:
The following is a summary of the main functions of a Translation Memory.
This function is used to transfer a text and its translation from a text file to the TM. Import can be done from a raw format, in which an external source text is available for importing into a TM along with its translation. Sometimes the texts have to be reprocessed by the user. There is another format that can be used to import: the native format. This format is the one that uses the TM to save translation memories in a file.
The process of analysis involves the following steps:
Export transfers the text from the TM into an external text file. Import and export should be inverses.
When translating, one of the main purposes of the TM is to retrieve the most useful matches in the memory so that the translator can choose the best one. The TM must show both the source and target text pointing out the identities and differences.
Several different types of matches can be retrieved from a TM.
A TM is updated with a new translation when it has been accepted by the translator. As always in updating a database, there is the question what to do with the previous contents of the database. A TM can be modified by changing or deleting entries in the TM. Some systems allow translators to save multiple translations of the same source segment.
Translation memory tools often provide automatic retrieval and substitution.
Networking enables a group of translators to translate a text together faster than if each was working in isolation, because sentences and phrases translated by one translator are available to the others. Moreover, if translation memories are shared before the final translation, there is an opportunity for mistakes by one translator to be corrected by other team members.
"Text memory" is the basis of the proposed Lisa OSCAR xml:tm standard.[2] Text memory comprises author memory and translation memory.
The unique identifiers are remembered during translation so that the target language document is 'exactly' aligned at the text unit level. If the source document is subsequently modified, then those text units that have not changed can be directly transferred to the new target version of the document without the need for any translator interaction. This is the concept of 'exact' or 'perfect' matching to the translation memory. xml:tm can also provide mechanisms for in-document leveraged and fuzzy matching.
The concept behind translation memories is not recent — university research into the concept began in the late 1970s, and the earliest commercializations became available in the late 1980s — but they became commercially viable only in the late 1990s. Originally translation memory systems stored aligned source and target sentences in a database, from which they could be recalled during translation. The problem with this 'leveraged' approach is that there is no guarantee if the new source language sentence is from the same context as the original database sentence. Therefore all 'leveraged' matches require that a translator reviews the memory match for relevance in the new document. Although cheaper than outright translation, this review still carries a cost.
Translation memory tools from majority of the companies do not support many upcoming languages. Recently Asian countries like India also jumped in to language computing, and there is high demand for translation memories in such developing countries. As most of the CAT software companies are concentrating on legacy languages, nothing much is happening on Asian languages.
One recent development is the concept of 'text memory' in contrast to translation memory.[3] This is also the basis of the proposed LISA OSCAR standard.[4] Text memory within xml:tm comprises 'author memory' and 'translation memory'. Author memory is used to keep track of changes during the authoring cycle. Translation memory uses the information from author memory to implement translation memory matching. Although primarily targeted at XML documents, xml:tm can be used on any document that can be converted to XLIFF[5] format.
Much more powerful than first-generation TMs, they include a linguistic analysis engine, use chunk technology to break down segments into intelligent terminological groups, and automatically generate specific glossaries.
Translation Memory eXchange (TMX) is a standard that enables the interchange of translation memories between translation suppliers. TMX has been adopted by the translation community as the best way of importing and exporting translation memories. The current version is 1.4b - it allows for the recreation of the original source and target documents from the TMX data. An updated version, 2.0, is under development[6]
TermBase eXchange. This LISA standard, which was revised and republished as ISO 30042, allows for the interchange of terminology data including detailed lexical information. The framework for TBX is provided by three ISO standards: ISO 12620, ISO 12200 and ISO 16642. ISO 12620 provides an inventory of well-defined “data categories” with standardized names that function as data element types or as predefined values. ISO 12200 (also known as MARTIF) provides the basis for the core structure of TBX. ISO 16642 (also known as Terminological Markup Framework) includes a structural metamodel for Terminology Markup Languages in general.[7]
Universal Terminology eXchange (UTX) format is a standard specifically designed to be used for user dictionaries of machine translation, but it can be used for general, human-readable glossaries. The purpose of UTX is to accelerate dictionary sharing and reuse by its extremely simple and practical specification.
Segmentation Rules eXchange (SRX) is intended to enhance the TMX standard so that translation memory data that is exchanged between applications can be used more effectively. The ability to specify the segmentation rules that were used in the previous translation may increase the leveraging that can be achieved.
GILT Metrics. GILT stands for (Globalization, Internationalization, Localization, and Translation). The GILT Metrics standard comprises three parts: GMX-V for volume metrics, GMX-C for complexity metrics and GMX-Q for quality metrics. The proposed GILT Metrics standard is tasked with quantifying the workload and quality requirements for any given GILT task.[8]
Open Lexicon Interchange Format. OLIF is an open, XML-compliant standard for the exchange of terminological and lexical data. Although originally intended as a means for the exchange of lexical data between proprietary machine translation lexicons, it has evolved into a more general standard for terminology exchange.[9]
XML Localisation Interchange File Format (XLIFF) is intended to provide a single interchange file format that can be understood by any localization provider. XLIFF is the preferred way of exchanging data in XML format in the translation industry.[10]
Translation Web Services. TransWS specifies the calls needed to use Web services for the submission and retrieval of files and messages relating to localization projects. It is intended as a detailed framework for the automation of much of the current localization process by the use of Web Services.[11]
xml:tm This approach to translation memory is based on the concept of text memory which comprises author and translation memory. xml:tm has been donated to Lisa OSCAR by XML-INTL
Gettext Portable Object format. Though often not regarded as a translation memory format, Gettext PO files are bilingual files that are also used in translation memory processes in the same way translation memories are used. Typically, a PO translation memory system will consist of various separate files in a directory tree structure. Common tools that work with PO files include the GNU Gettext Tools and the Translate Toolkit. Several tools and programs also exist that edit PO files as if they are mere source text files.
Desktop translation memory tools are typically what individual translators use to complete translations. They are a specialized tool for translation in the same way that a word processor is a specialized tool for writing.
Centralized translation memory systems store TM on a central server. They work together with desktop TM and can increase TM match rates by 30-60% more than the TM leverage attained by desktop TM alone. They export prebuilt "translation kits" or "t-kits" to desktop TM tools. A t-kit contains content to be translated pre-segmented on the central server and a subset of the TM containing all applicable TM matches. Centralized TM is usually part of a globalization management system (GMS), which may also include a centralized terminology database (or glossary), a workflow engine, cost estimation, and other tools.